As a part of our course Datascience with R, we are working on a Business problem, in order to enhance our R programming skills and explore various libraries and improve our Visualization techniques. We are working on IWD dataset which comprise of responses of Customers in likert scale for over 35 questions divided over 10 Blocks.
From the dataset we singled out Block-6 which contains Individual Satisfaction of Groups and Block-7 which contains Individual Satisfaction of Branches , which is related to customer satisfaction on the individual Product groups and services provides by the supermarket respectively .
Block-6 Contains 17 questions on Customer’s Satisfaction on Individual Product Groups.
Block-7 Contains 10 questions on Customer’s Satisfaction on Branch factors like Employee service , Cash Register Service , Ambience etc
In this section we will be analysing the 17 Questions (80 features) present in the Block-6 In 2 approaches:
Considering all the Features
Analysis by removing majority NA Columns.
Further ,we planned to use Kmeans algorithm on the questions, to obtain optimum clusters we used Elow Method on the 16 feature dataset and we got k=2 as optimum kas seen from the graph below.
We performed Kmeans with K=2 and plotted a bar plot which shows the segmentation of the customers who are homogenious on x axis we have Features and on y axis we have cluster results.
## Overall Snacks Regional Private
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.833 1st Qu.:3.989 1st Qu.:4.000 1st Qu.:3.817
## Median :4.167 Median :3.989 Median :4.023 Median :3.817
## Mean :4.109 Mean :3.989 Mean :4.023 Mean :3.817
## 3rd Qu.:4.667 3rd Qu.:3.989 3rd Qu.:4.333 3rd Qu.:3.817
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Vegan Branded FruitVeggies BreadnBaked
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:3.833 1st Qu.:4.000
## Median :4.285 Median :4.107 Median :4.141 Median :4.165
## Mean :4.285 Mean :4.107 Mean :4.141 Mean :4.165
## 3rd Qu.:5.000 3rd Qu.:4.667 3rd Qu.:4.667 3rd Qu.:4.500
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Flesh Sausages Dairy Sweets
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000
## Median :4.193 Median :4.260 Median :4.333 Median :4.301
## Mean :4.193 Mean :4.260 Mean :4.326 Mean :4.301
## 3rd Qu.:4.667 3rd Qu.:4.833 3rd Qu.:5.000 3rd Qu.:4.833
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Alcoholic SoftDrinks Cosmetics Utensils
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.194 1st Qu.:4.000 1st Qu.:4.121 1st Qu.:3.969
## Median :4.194 Median :4.225 Median :4.121 Median :3.969
## Mean :4.194 Mean :4.225 Mean :4.121 Mean :3.969
## 3rd Qu.:4.194 3rd Qu.:4.800 3rd Qu.:4.121 3rd Qu.:3.969
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
From all the 18 Questions we are checking the count of total responses given by the customers
From the above plot we can see that the Maximum Non-NA count is 2303 and minimum is 468.The questions AlcoholicDrinks ,‘Cosmetics’, ‘Organic’, ‘Snacks’, ‘Utensils’, ‘Vegan’, have significantly less count of non zeros. We are eleminating the Groups which are less than 1152 data points. Why..? Because more than 50% of the rows in these Question Groups are NA’s, therefore in order to understand the influence of these missing values, we removed the columns with large number of missing values and repeated the above analysis. Note that in all the other columns that are not removed, we fill the NA’s with mean value of those columns. This way we are left with 11 features (questions) in the dataset.
We performed Kmeans with K=2 and plotted a bar plot which shows the segmentation of the customers who are homogenious on x axis we have Features and on y axis we have cluster results.
From the graph, we can see that a high number of customers are between the age 46-65 and very few customers of age group less than 25 and greater than 75.
From the barplot, we can see that the number of female customers is higher than male customers but the difference is not significant.
From the barplot, we can see that there are a lot of customers with income less than 1.250 euros and there is a downward trend as the income goes up the customer count goes down. Many customers are hesitant to mention their income for such cases it is filled with 99 value. This value is not useful for our calculation. We have made an assumption that people with the same age group have similar income. So for customers who are having 99 value, we filled with a mean value of their age group’s income.
From the graph, it is clear that almost 90 percent of the customers have no children and very few customers have 1 or 2 children.
From the graph, we can see that most of the customers have 1 or 2 people in the household. And the trend goes down when the number of people increases.
From the graph, we can see that the highest number of customers are from state Nordrhein-Westfalen which is around 500 and the lowest are from Bremen which is around 10 customers.
We have performed K-means clustering on our demographic data.We have figured out the optimal number of clusters by using the Elbow method and silhouette coefficient. From the silhouette graph, we see that the number of optimal clusters can be either 2 or 3 and from the Elbow method we have observed the bend at 4. But in the case of the Elbow method, we can choose the number ofclusters that are close to the bend. So, We consider 3 as the optimal number of clusters.
We have visualized our clustering result using popular dimensionality reduction techniques like PCA and t-sne. From the PCA graph, we can see that the clusters are well separated. But from the t-sne graph we see that there are some overlappings between the clusters even then we can identify the groups easily.
| Groups | Avg_Age | Avg_Income | Avg_children | Avg_people | count | |
|---|---|---|---|---|---|---|
| group1 | 1 | 5 | 5 | 1 | 2 | 706 |
| group2 | 2 | 5 | 5 | 1 | 2 | 1047 |
| group3 | 3 | 5 | 13 | 2 | 3 | 550 |
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.
The above graphs don’t convey any information about the behavior of the clusters. The above table illustrates the similarity between customers. We have calculated the mean of the numeric data like age, income, number of children, and number of people. As we cannot calculate the mean for the categorical data like State and Gender. From the table, we observe that clusters 1 and 2 have no difference. But by plotting a graph between income and state we have drawn some new insights.
Customer group1 has low to average income and they belong to states 1 to 7 (Baden-Württemberg to Hessesn).
Customer group2 has low to average income and they belong to states 8 to 16 (Mecklenburg-Vorpommern to Thüringen).
Customer group3 are high-income people of all the states.
when decision-makers want to implement a particular strategy. They will focus on a particular subgroup rather than whole customers in general. Because the needs are different for different customers. So, it is a good strategy to group customers based on some common aspects and implements strategies to improve their satisfaction.
This section focuses on the development of CSI (Customer Satisfaction Index). It evaluates and assesses the dataset from the outlook that helps in comparative analysis of the provided superstores and products.
As the questionnaire has several sections, the sections that are directly linked to CSI development are from section 11 till section 26. The points or criterion that each section focuses on, are;
Moreover, some later sections consider three more criterion to evaluate the superstores and products, and they are;
Each section addresses a unique category of products that are available and accessible across all the superstores considered in this questionnaire. The details regarding the sections and the products’ categories are as follows;
Next comes the scale or measurement criteria. The scale that is employed to gauge each questions in the above said section is called Likert Scale. Likert Scale is most commonly and widely used psychometric scale in questionnaires. In Likert Scale, all the possible options are grouped and normalized on a scale and it (scale) can be of any length. Most commonly used Likert Scale contains either 5 or 7 points scale.
In this questionnaires, for the above mentioned questions, a 5 points scale is used and that is as follows;
The previous section discusses the questions and details that are necessary for the development of CSI and measure customers’ satisfaction. This sections highlights the approaches and steps that are the considered in the early analysis of the data.
Initially we, without performing any preprocessing (dimensionality reduction, for instance), plotted all the superstores and products. Fig. 1 illustrates overall satisfaction level of stores across every product. As it is mentioned in the previous section that each product are being evaluated either using 3 or 6 questions, therefore, this initial plot takes average of all the questions in each product (or section) and plots them against stores. Horizontal axis represents products, vertical axis shows satisfaction level, and the lines represents the stores. In this plot, it is evident from a first glance that majority of the superstores share similar level of satisfaction level. For instance, for product Alokohal the stores except Vmarkt, Combi, and Familia all fall in the range from 4.0 – 4.25 approximately. Similarly, for the other stores the trend suggests that all the stores, except a few, are nearer to each other as they makes a chain. Nevertheless, there are a few stores that stands out in the graph. For instance, if we follow the trend for VMarkt, it has the lowest satisfaction for 5 (Alkohol, Alkoholfrei, BB, Og, and Suess) out of 15 products. Also, graph suggest that the store Andere has the lowest satisfaction level for product Gegenstaende. This plot is helpful in identifying which stores has the high and low satisfaction score in each product, but it is quite cumbersome and too difficult to decode any hidden information, if any.
We then tried to look at the data from a different angle. Figure 2 uses R’s grid feature in that the stores a grouped in a way that helps in identifying the store the customers are most satisfied with. Horizontal axis represents products and vertical axis shows satisfaction scale. The important aspect of this plot is the usage of gradient color to distinguish different satisfaction levels, and due this approach the plot, at a first glance, gives a sense of Heatmap. Also because of the gradient colors, it is easy to identify any store or product that has low or high satisfaction level as compared to others. For instance, store Andere has the lowest satisfaction score, 3.48, for product Gegenstaende. Also, when comparing store Real with VMarkt, one can easily deduce that customers are more satisfied with Real than VMarkt. This plot, although helps in distinguishing the stores, it still does separate the stores that have high review counts than those having fewer reviews.
In the last section although we were able to identify some pattern but differentiating the stores that shares the same trend was hard to identify. This section focuses on identifying the hidden patterns in the data. It focuses on reduction of the dimensions so that data can be analyzed more accurately and precisely. To get rid of the data that contributes less and is not informative enough for the analysis, we employed dimensionality reduction approach and in this regard it employees PCA (Principle Component Analysis). This sections performs PCA in two ways. In the first approach it considers all the stores as principle components and plots products across them. Later, in the second approach the same approach is repeated for products where products are principle components and stores are the data points.
Fig. 3, in this plot, principle components refers to products and that is why there are 15 principle components mentioned. Also, in the graph the vectors with red color and caption represents the stores and the labels in black represents the products in the system. This graph gives some important insights about the relation between stores and products. For example, VMarkt, Nahfrish, and Feneberg are correlated as they all grow in similar direction. Similarly, JibiMarkt and Klasskock also are forming a group, and the rest except Tengu, MixMarkt, and Hit growing almost in similar direction and share more information together.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.8644 1.73760 1.15834 0.99171 0.97917 0.76054 0.6120
## Proportion of Variance 0.7395 0.09435 0.04193 0.03073 0.02996 0.01808 0.0117
## Cumulative Proportion 0.7395 0.83381 0.87574 0.90647 0.93644 0.95451 0.9662
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55549 0.4898 0.44705 0.36101 0.32815 0.26619 0.15466
## Proportion of Variance 0.00964 0.0075 0.00625 0.00407 0.00337 0.00221 0.00075
## Cumulative Proportion 0.97586 0.9833 0.98960 0.99367 0.99704 0.99925 1.00000
## PC15
## Standard deviation 9.594e-15
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
Similarly, we plotted PCA, Fig. 4, for stores and products and this time we make stores as principle components. This plot also helps in finding out which stores share the information and can be helpful in comparative analysis. For example, Snacks, Gegenstaende, and Vegan are not correlated to the rest of the products. Similarly, by looking at the stores, it is evident that stores in the 2nd quadrant can be compared with each other, than considering them with the stores like VMarkt and Coop.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.1287 1.5034 0.92045 0.79499 0.6124 0.56308 0.47990
## Proportion of Variance 0.6526 0.1507 0.05648 0.04213 0.0250 0.02114 0.01535
## Cumulative Proportion 0.6526 0.8033 0.85975 0.90188 0.9269 0.94802 0.96337
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.35962 0.34352 0.3216 0.27778 0.23721 0.17512 0.15564
## Proportion of Variance 0.00862 0.00787 0.0069 0.00514 0.00375 0.00204 0.00161
## Cumulative Proportion 0.97200 0.97986 0.9868 0.99190 0.99565 0.99770 0.99931
## PC15
## Standard deviation 0.10153
## Proportion of Variance 0.00069
## Cumulative Proportion 1.00000
The PCA plots, mentioned above, provide a decent overview about the stores and products that are interconnected and would produce more insight when plotting them together than with those that participates more in different principle component. Therefore, to further dig down through the selection that what stores should be clubbed together, we tried another plots that are as follows.
In this regard, Fig. 5 plots stores and their review count. As evident from the plot that the data is quite imbalance in terms of reviews. Some stores has more review as compared to others, and there is a significant difference between the stores that have more reviews than those that have fewer reviews.
We decided to use a threshold value to split the stores in two sections. The first group would have the stores that have a significant number of reviews, and the second group would have the stores with fewer reviews. But the problem is how to choose the cut-off value. We again referred to PCA plot where stores are chosen as principle components. The clusters there suggests that the stores with reviews’ count lesser than 200 can be grouped together and the rest can be moved to the first group.
Similarly, since the above plot helped in splitting the stores, we repeated the same process for products. And, with the help of this plot, Fig. 6, and the PCA plot, Fig. 3, the cut-off value we came across is 1500. Store with reviews’ count less than 1500 belongs to second group and the rest belong to first group.
This section concentrates on the detailed analysis of the stores and products. In the last section we formed groups and split the stores and products in four groups that are follows;
The first two plots, Fig. 7 and Fig. 8, plot the stores that have reviews’ count higher than the threshold against both the product groups. These plots use R’s facet_wrap functionality to exploit data patterns and display data in more scattered/grouped form. Each sub-grid in this plot represents a satisfaction level for each individual store against all the available products’ categories. Horizontal axis represents product categories, and vertical axis shows satisfaction level for respective products’ category in each sub-grid. These plots also do use different colors to plot different satisfaction levels and due to this it is comparatively easy to distinguish the stores that are popular and having better customer reviews. In the first plot, Fig. 7, where all the stores have reviews’ count higher than 200, follow similar pattern. Customers’ reviews for Molke was high across every stores, and for Regional its low except for the Real as customers recorded better reviews for it than the rest. Furthermore, by looking at the steepness of the line, store Real and Norma has comparatively less steep trend than the rest.
In Fig. 8, where the same stores are plotted against the products having lesser reviews’ count than 1500, the trend is similar here as well. All the stores have better satisfaction level for Alkohol and relatively negative for Vegan. Contrary to the last plot, Amazon has better satisfaction level as compared to others except Real for which customers seemed to be more satisfied.
Now let’s plot stores and products data for Group 2 and Group 4. Fig. 9 plots stores having reviews’ count less than 200 against the products that have reviews’ count more than 1500. At a first glance it is evident that some stores have high satisfaction levels than others. Stores Alnatura, Bofrost, Familia, Globus, and Nettoschwarz have satisfaction level above 4.0 for all the mentioned products. At downside store VMarkt, has the lowest satisfaction level as it has no points that are marked as green color. VMarktis followed by store Coop and MixMarkt respectively. The important point that makes this plot different to the plots in the last section, where each product shared almost similar satisfaction level, is, in this plot one can easily distinguish the stores with the help of the satisfaction level.
Fig. 10 plots data for Group 4. This plot also follows the same trend as last one. Some stores has better satisfaction level. For instance, store Nasfrish and Alnatura has better satisfaction level for each product. The stores that has bad reviews are Andere and MixMarkt. The point that applies to this plot as well is that store can be recommended for all the list product, store Alnatura can be suggest if customer want to by the mentioned products.
The last section focuses on stores and helps in identifying the stores that have better satisfaction level. This section considers the same groups but concentrates on the detailed analysis of the data from products perspective. We consider the similar four groups in this section.
The first two plots, Fig. 11 and Fig. 12, plot the products that have reviews’ count higher than the threshold against both the stores’ groups. These plots use R’s facet_wrap functionality to exploit data patterns and display data in more scattered/grouped form. Each sub-grid in this plot represents a satisfaction level for each individual product against all the available stores. Horizontal axis represents stores, and vertical axis shows satisfaction level for respective stores in each sub-grid. These plots also do use different colors to plot different satisfaction levels and due to this it is comparatively easy to distinguish the products that are popular and having better customer reviews. In the first plot, where all the products have reviews’ count higher than 1500, product Molke is the most popular that is followed by Suess, Eigenmarke, Alkoholfrie and Wurst. The product for which the customer provided relatively bad reviews is Regional one that is followed by Marken, and Og.
The 2nd plot, Fig. 8, where the customers recorded fewer reviews for the products, Alokohol has the best reviews, and Vegan has comparatively worst remarks.
Similarly the same approach is used for the stores that have fewer reviews’ count than the agreed threshold. In the plots, Fig. 9 and Fig. 10, the stores for which the customers have recorded fewer reviews’ count are plotted against all the products. In data science it is hard to find patterns when data is not enough and it applies to these two plots. The first plot, Fig. 9, all the products shares almost the same trend. Although product Molke is better for the first couple of stores but overall it does not give much insight. Fig. 10 rather follows a pattern and as it passes through the products, satisfaction level for the products decreases but, still there is no clear patterns.
Finally, we tried to perform exploratory analysis from a different angle. It has been discussed that each product has been evaluated using either 3 or 6 different questions. These questions belong to following sections; 1. Diversity or range of products 2. Freshness of products 3. Goods availability 4. Presentation of goods 5. Price and performance 6. Quality of products
In Fig. 15 we incorporated above categories and perform analysis in this direction. Horizontal axis represents the stores and vertical axis show satisfaction level for each store against the questions’ categories mentioned above.
The underlying motivation to perform such analysis is to find out the category that customers’ give more importance to. And, as evident from the plot that no customer have recorded reviews for questions that belong to “Diversity & Range”, “Goods Availability”, and “Price & Performance”.
For recommendation task we considered this as multi label classification problem. From the previous analysis, we selected the best features which helps us to understand the customer mindset.
In the data, we have 44 questions and their responses from the customers. Each question can be taken as a feature or the sub questions can also be taken as separate features. Based on the analysis did before, below are the features selected which gives the correct mindset and predict the store for the new user.
After extracting the respective data frames from the above columns selected, as most of them are haven labelled, extracted the average values for few questions and directly values for the sub questions as follows.
As the average values were for the question responses, replaced the NA with value 0.
The labels for the dataset will be store names which customer selected as the main store. As the responses as haven labelled, extracted the store name using as_factor() as shown below.
store_cols <- c("F3_Haupteinkaufsstaette")
stores_df <- customerdata[store_cols]
stores_df = as_factor(stores_df)
head(stores_df)
## # A tibble: 6 x 1
## F3_Haupteinkaufsstaette
## <fct>
## 1 Netto (Rot)
## 2 Rewe
## 3 Rewe
## 4 Aldi Nord
## 5 Rewe
## 6 Rewe
The features extracted and labels extracted as prepared as final dataset below. For few labels as there is no data for them, they have been ignored for the final dataset and the glimpse of data set is as follows.
data_set <- cbind(cust_loyality, br_loc_sat, findability, price_setting, quality_setting,
assortment, quality, frische, value_for_money, availabity,
presentation_goods,
stores_df)
data_set <- droplevels(data_set)
head(data_set)
## cust_loyality br_loc_sat findability price_setting quality_setting assortment
## 1 2.666667 2.333333 2.0 4.75 4.2 4
## 2 4.666667 3.333333 5.0 2.50 4.6 5
## 3 4.000000 5.000000 3.0 3.50 3.6 4
## 4 4.333333 2.666667 5.0 5.00 4.4 5
## 5 4.000000 4.666667 3.0 3.50 3.8 5
## 6 4.333333 3.333333 3.5 2.75 2.4 5
## quality frische value_for_money availabity presentation_goods
## 1 4 4 4 3 3
## 2 5 5 4 5 5
## 3 5 4 5 4 5
## 4 5 5 5 5 5
## 5 4 4 3 4 5
## 6 5 5 2 5 5
## F3_Haupteinkaufsstaette
## 1 Netto (Rot)
## 2 Rewe
## 3 Rewe
## 4 Aldi Nord
## 5 Rewe
## 6 Rewe
Used caret library for Machine learning algorithms. Partitioned the data 80% for training and 20% for validation. After the partitioning the data set, the training data set the samples were 1854. Distribution of features are as follows Distribution of labels are as follows.
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data_set$F3_Haupteinkaufsstaette, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data_set[-validation_index,]
# use the remaining 80% of data to training and testing the models
data_set <- data_set[validation_index,]
################### Analysing Dataset ##########################
dim(data_set)
## [1] 1854 12
# list types for each attribute
sapply(data_set, class)
## cust_loyality br_loc_sat findability
## "numeric" "numeric" "numeric"
## price_setting quality_setting assortment
## "numeric" "numeric" "numeric"
## quality frische value_for_money
## "numeric" "numeric" "numeric"
## availabity presentation_goods F3_Haupteinkaufsstaette
## "numeric" "numeric" "factor"
#Summarize the class distribution
percentage <- prop.table(table(data_set$F3_Haupteinkaufsstaette)) * 100
cbind(freq=table(data_set$F3_Haupteinkaufsstaette), percentage=percentage)
## freq percentage
## Edeka 333 17.96116505
## Rewe 307 16.55879180
## Lidl 271 14.61704423
## Aldi Süd 145 7.82092772
## Kaufland 220 11.86623517
## Netto (Rot) 174 9.38511327
## Aldi Nord 107 5.77130529
## Amazon 4 0.21574973
## Real 60 3.23624595
## Penny 88 4.74649407
## DM 7 0.37756203
## Rossmann 4 0.21574973
## Globus 21 1.13268608
## Norma 27 1.45631068
## Combi 5 0.26968716
## Jibi Markt 1 0.05393743
## Hit 7 0.37756203
## Netto (Schwarz) 8 0.43149946
## Tegut 9 0.48543689
## Klaas+Kock 1 0.05393743
## Denn's 6 0.32362460
## Alnatura 4 0.21574973
## Budnikowsky 1 0.05393743
## Feneberg 1 0.05393743
## Famila 9 0.48543689
## Markant 2 0.10787487
## nah&frisch 1 0.05393743
## Mix Markt 1 0.05393743
## Andere 30 1.61812298
# summarize attribute distributions
summary(data_set)
## cust_loyality br_loc_sat findability price_setting
## Min. :1.000 Min. :0.3333 Min. :1.000 Min. :1.000
## 1st Qu.:3.333 1st Qu.:3.0000 1st Qu.:3.500 1st Qu.:3.000
## Median :4.000 Median :3.3333 Median :4.000 Median :3.750
## Mean :3.877 Mean :3.6223 Mean :3.934 Mean :3.622
## 3rd Qu.:4.333 3rd Qu.:4.6667 3rd Qu.:4.500 3rd Qu.:4.250
## Max. :5.000 Max. :5.0000 Max. :5.000 Max. :5.000
##
## quality_setting assortment quality frische
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.400 1st Qu.:4.000 1st Qu.:4.000 1st Qu.:4.000
## Median :3.600 Median :4.000 Median :4.000 Median :4.000
## Mean :3.692 Mean :4.132 Mean :4.247 Mean :4.179
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## value_for_money availabity presentation_goods F3_Haupteinkaufsstaette
## Min. :1.000 Min. :1.000 Min. :1.000 Edeka :333
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:4.000 Rewe :307
## Median :4.000 Median :4.000 Median :4.000 Lidl :271
## Mean :4.079 Mean :3.978 Mean :4.017 Kaufland :220
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 Netto (Rot):174
## Max. :5.000 Max. :5.000 Max. :5.000 Aldi Süd :145
## (Other) :404
x <- data_set[,1:11]
y <- data_set[,12]
head(y)
## [1] Netto (Rot) Rewe Rewe Aldi Süd Edeka Norma
## 29 Levels: Edeka Rewe Lidl Aldi Süd Kaufland Netto (Rot) Aldi Nord ... Andere
levels(y)
## [1] "Edeka" "Rewe" "Lidl" "Aldi Süd"
## [5] "Kaufland" "Netto (Rot)" "Aldi Nord" "Amazon"
## [9] "Real" "Penny" "DM" "Rossmann"
## [13] "Globus" "Norma" "Combi" "Jibi Markt"
## [17] "Hit" "Netto (Schwarz)" "Tegut" "Klaas+Kock"
## [21] "Denn's" "Alnatura" "Budnikowsky" "Feneberg"
## [25] "Famila" "Markant" "nah&frisch" "Mix Markt"
## [29] "Andere"
# boxplot for each attribute on one image
par(mfrow=c(1,5))
for(i in 1:5) {
boxplot(x[,i], main=names(data_set)[i])
}
# barplot for class breakdown
plot(y)
Work by Data Science Cubs